First, I’d like to comment that I will take a unique approach in this report. Since the actual interaction between stakeholders is limited to the information as stated in the Google Data Analytics Capstone Case Study Course, I am limited in my ability to ask direct questions to the stakeholders for clarification on the purpose of the analysis as well as about the data. I had some questions about the ficticious bike company Cyclistic and the data, which is based on the bike share system in the city of Chicago.
Thus, I did some of my own research on Divvy Bike system in Chicago to help try to clarify my curiosities and concerns. Some of my comments for this report will be from the perspective of inside the role-play/case study, while other will be “meta” comments.
Second, this is an amalgamation of two separate analyses. I found it more relevant to expand beyond the original constraints set out in the instructions, as I could not interact with the actually stakeholders. In the first one, I looked at data, grouped quarterly, starting from April 2019 to March 2020 (as the original r-script prompted to). During this initial analysis, I was prompted by many questions and turned to my own research. After further research on the Divvy website, I discovered some key information about the pricing system and types of bikes. I finished the initial analysis with more questions for the business task at hand.
My research found that the city introduced electronic bikes in around July/August of 2020, in which the purpose of that measure was to increase ridership. Also given that the COVID19 pandemic had just started at the beginning of 2020, I concluded that this might also impact ridership in a major way. I thought it relevant to include more data to capture the effects of these two key factors, in addition to some new bench marks I had established. In the second analysis, I looked at monthly data from June 2020 to May 2021. I then combined the two data sets and added more data (April/May 2020, June 2021) into one bigger analysis, for a total of 27 months of data.
Third, the original R-script was created by Kevin Hartman (Hartman) and was included in the case study assignment. For the sake of this case study, I will assume he is a member of my data analytics team.
Finally, the analysis and conclusions found here are the results of this my combined data analyses built upon the original R-script by Hartman.
The fictional Cyclistic bike-share program is based on the Divvy bike-share system in Chicago. The company is looking to increase the number of annual subscribers and it is my role to see how riders who have a membership to the system differ from those that don’t, and how the company can boast subscriber numbers through social media.
The following is detailed summary of the Case Study taken from the Capstone Assignment document, and modified slightly for brevity.
The company Cyclistic is a bike-share program with more than 5,800 bicycles and 600 docking stations. The bikes can be unlocked from one station and returned to any other station in the system anytime. The service sets itself apart from other services by offering various other bikes for people with disablities, but the majority of riders (92%) opt for traditional bikes. It has flexible pricing plans: single-ride or full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
The stakeholders are Lily Moreno, the director of marketing and my manager, the marketing analytics team, and the executive team. The executive team is detailed oriented and will make the final decisions. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program, including social media. My team and I are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
Marketing strategy had relied on building general awareness and appealing to broad consumer segments, and the pricing flexibility helped Cyclistic attract more customers. Now, Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders, and Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of and have chosen the program.
The goal is clear: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand the trends on how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.
Three questions will guide the future marketing program: 1. How do annual members and casual riders use Cyclistic bikes differently? 2. Why would casual riders buy Cyclistic annual memberships? 3. How can Cyclistic use digital media to influence casual riders to become members?
As stated above in the case-study brief, the assignment clearly sets up the situation and my role in the analysis. I am working with my data analytics team, so for the sake of this role play, I am acting as if Kevin Hartman is a part of my team, and that he performed some initial analysis and wrote some R-code. Therefore, it is assumed that he helped me with this report. His contribution is duly noted here. Furthermore, I am to report directly to Lily Moreno and prepare my findings for the executive team.
With this in mind:
My goal is to make recommendations on how to convert casual users to annual users through the influence of digital media.
My business task is to find out the difference between how annual and casual members use the bikes in order to answer why a casual member would purchase an annual membership.
This is the starting point of my analysis.
The data in this analysis was downloaded directly from the Divvy website through AWS. It follows the principles of being reliable, original, comprehensive, current, and cited data. In other words, it ROCCCs!
It is made available by Motivate International Inc. under this license found here. Copies of the data are stored on my computer for use in my analysis. Proper Data Ethics are met with my compliance under this license. In short, I am using it for the sole purpose of analysis in the case study, and not for commercial purposes. In addition, I am not conducting data mining nor trying to correlate data with names or other information of customers. Lastly, I am in no way affiliated, approved, sponsered or endorsed by Motivate International Inc.
After a short struggle with the data and spreadsheets, it quickly became apparent that due to the large nature of the data, I would need to use R to perform my data analysis. Since the data is open and readily available to use, I downloaded and stored copies on my own computer for use with R Studio. I was ready for a quick look at the data.
I had some initial questions at the back of my mind, and after a quick look at the initial data, I had more questions and I wanted to establish some metrics for the analysis.
Due to the fact that I cannot ask questions directly to any stakeholder, I decided to research the organization that the case study is based on https://www.divvybikes.com. I also found this report by the city of Chicago to be useful. Here are some key takeaways:
Based on this cost structure, it would be beneficial to find out how often a non-member uses the system versus members. That is how many single ride trips they take in a year and/or how many day passes they purchase in a year.
On a per month basis, the annual membership averages out to be equivalent to the cost of a little less than three single-ride trips per month. Time-wise, in three trips with an annual membership, you get a total 8,100 seconds (135 mins) and $0.90 savings compared with 5,400 seconds (90 mins) purchasing single-ride trips. At a total cost of $108, it is equivalent to 32.7 single-ride trips per year. Factoring in the unlimited 45-min rides within a month, it is far better to be a member than pay for single-ride trips if you are over than 32 ride threshold within a year, even if they have to pay upfront. Regular single-ride users should be purchasing a member pass, especially with D4E where you can get a membership fee at a reduced cost under certain conditions.
Comparing single rides with day passes, in time-wise measures, you get unlimited rides in three hour chunks that would cost $19.80 for one three hour trip in single rider fees, all for $15. Four shorter single ride trips in one day, or getting on and off the bike twice, would cost you $13.20. It is easy to see that anyone intending to stop-and-go more than twice in a day, or taking longer rides, would be better off buying a day pass.
For a casual customer who predominantly uses the system for this type of unlimited daily travel, an annual pass is equivalent to 7.2 day passes in a year. Even with a trade off of shorter ride times (Annual members can pay by the minute if they go over 45 mins), it seems reasonable to buy an annual membership if you are using day passes on the weekend or on other occasions a couple of times a month. In addition, for casual users who buy single rides around 30 times per year, or the equivalent of 5 times a month in half a year, it would be financially viable to purchase an annual membership. Any current customer or potential bike user who fits into either of these two categories just need to be urged to make the cost-benefit calculus of buying an annual membership up front.
One hesitation for buying an annual membership may be the climate. Many people in Chicago are probably discouraged to ride in the cold weather, so it is likely there are people who would only ride from May to October when the weather is favorable. They may view buying an annual membership as a waste of money when considering the fact that they wouldn’t use it for half of the year. But this is the wrong way to look at it.
For the success of the campaign, it is vital to market the casual single-riders who takes advantage of the bikes when the sun is out a couple of times per week, and day pass users who may only ride on weekends in the warmer weather. For customers who lie within the bounds of those benchmarks, even if only riding in the warmer months, the cost-benefit of an annual membership outweighs continuing to purchase non-member rides. If you are a casual member who mixes both buying day passes and single-rides every month in warmer times, or any time of the year for that matter, the trade off of an upfront membership versus unlimited 45 minute rides anytime should be an easy decision: even just taking rides on Saturdays or Sundays each month in the warmer months would be a better financial option in the long run. And, once having the pass in one’s possession, one would be more inclined to use it any time and thus creating a bigger windfall.
If it is possible to get the details of repeat casual users and how they use the system, it would optimize the best way to market to those individuals. However, because of privacy concerns as stated in the case study assignment sheet “that data-privacy issues prohibit you from using riders’ personally identifiable information…to connect pass purchases…to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.”, it may not be possible to gather this requisite information, and I will have to rely on the duration of trips and total number of trips, rather than the number of trips an individual user takes, to analyze the data. In a real life situation, I would inquire more to see if there was any way to obtain this information by consulting with the data team.
Therefore, in addition to the questions outlined earlier, some additional questions are:
With these new goals and business task in mind, I was ready to process the data.
I started with the original R script by Hartman, available here at https://artscience.blog/home/divvy-dataviz-case-study, as the launching point for my analysis. All code is a modification and expansion of that original script. Some code and results are excluded or hidden for readability.
I used tidyverse, lubridate, ggplot2, and skimr (code excluded) for processing and analysis.
I read in the four data sets: Divvy_Trips_q2_2019.csv to Divvy_Trips_q1_2020.csv (code excluded)
I examined the structure of the data using the skimr and dyplr packages to get a detailed look at it.
glimpse(q1_2020)
glimpse(q4_2019)
glimpse(q3_2019)
glimpse(q2_2019)
skim_without_charts(q1_2020)
skim_without_charts(q4_2019)
skim_without_charts(q3_2019)
skim_without_charts(q2_2019)
The key findings for quarterly data are:
All of the findings above and the modifications below were done in accordance with the analysis by Hartman.
I renamed all columns to be consistent with q1_2020 data and beyond.
I converted the type of ride_id and rideable_type to characters in order to bind the charts.
I created a data frame: all_trips_quarter
I removed start and end station ID and lat./long. data, as well as gender and birthday fields, which were only in the q2_2019 data.
(q4_2019 <- rename(q4_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
(q3_2019 <- rename(q3_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
(q2_2019 <- rename(q2_2019
,ride_id = "01 - Rental Details Rental ID"
,rideable_type = "01 - Rental Details Bike ID"
,started_at = "01 - Rental Details Local Start Time"
,ended_at = "01 - Rental Details Local End Time"
,start_station_name = "03 - Rental Start Station Name"
,start_station_id = "03 - Rental Start Station ID"
,end_station_name = "02 - Rental End Station Name"
,end_station_id = "02 - Rental End Station ID"
,member_casual = "User Type"))
q4_2019 <- mutate(q4_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
q3_2019 <- mutate(q3_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
q2_2019 <- mutate(q2_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
all_trips_quarter <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020)
all_trips_quarter <- all_trips_quarter %>%
select(-c(start_station_id, end_station_id, start_lat, start_lng, end_lat, end_lng, birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped",
"05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))
glimpse(all_trips_quarter)
skim_without_charts(all_trips_quarter)
table(all_trips_quarter$member_casual)
I transformed the customer and subscriber to member and casual to be consistent with q1_2020 data. A note on this nomenclature was mentioned by Hartman.
In my research, according to the report by the city of Chicago, mentioned earlier and found here, ebikes were introduced in July of 2020, and the data analysis following this first part will show this result as well. I therefore concluded that docked_bike here refers to what will later be classified as classic_bike (a pedal bike) and I decided to transform all values in rideable_type to docked_type from a bike_id number, to keep with this nomenclature. If this were not a role-play, I would confirm this assumption with a stakeholder.
I standardized the bike type to one: docked type.
table(all_trips_quarter$member_casual)
all_trips_quarter <- all_trips_quarter %>%
mutate(member_casual = recode(member_casual
,"Subscriber" = "member"
,"Customer" = "casual"))
# Change all former bike_id to "docked_bike"
all_trips_quarter$rideable_type <- c("docked_bike")
which(all_trips_quarter$rideable_type != "docked_bike")
Here is a summary of the final data for all_trips_quarter to confirm this step in the completion of data cleaning.
glimpse(all_trips_quarter)
## Rows: 3,879,822
## Columns: 7
## $ ride_id <chr> "22178529", "22178530", "22178531", "22178532", "22…
## $ started_at <dttm> 2019-04-01 00:02:22, 2019-04-01 00:03:02, 2019-04-…
## $ ended_at <dttm> 2019-04-01 00:09:48, 2019-04-01 00:20:30, 2019-04-…
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ start_station_name <chr> "Daley Center Plaza", "Wood St & Taylor St", "LaSal…
## $ end_station_name <chr> "Desplaines St & Kinzie St", "Wabash Ave & Roosevel…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
unique(all_trips_quarter$member_casual)
## [1] "member" "casual"
unique(all_trips_quarter$rideable_type)
## [1] "docked_bike"
I read in the fifteen data sets: 202004-divvy-tripdata.csv to 202106-divvy-tripdata.csv (code excluded)
I examined the structure of the data to get a detailed look at it.
glimpse(april_2020)
glimpse(may_2020)
glimpse(june_2020)
glimpse(july_2020)
glimpse(aug_2020)
glimpse(sept_2020)
glimpse(oct_2020)
glimpse(nov_2020)
glimpse(dec_2020)
glimpse(jan_2021)
glimpse(feb_2021)
glimpse(march_2021)
glimpse(april_2021)
glimpse(may_2021)
glimpse(june_2021)
skim_without_charts(april_2020)
skim_without_charts(may_2020)
skim_without_charts(june_2020)
skim_without_charts(july_2020)
skim_without_charts(aug_2020)
skim_without_charts(sept_2020)
skim_without_charts(oct_2020)
skim_without_charts(nov_2020)
skim_without_charts(dec_2020)
skim_without_charts(jan_2021)
skim_without_charts(feb_2021)
skim_without_charts(march_2021)
skim_without_charts(april_2021)
skim_without_charts(may_2021)
skim_without_charts(june_2021)
The key findings for the monthly data are:
As mentioned in the previously mentioned report, “classic pedal bikes will continue to be parked at Divvy docking stations only [while] ebikes can be parked at Divvy docking stations, Divvy E-stations, or any public bike rack.” The report also mentions that “Divvy E-stations are clusters of bike racks.” In addition, I also examined the Divvy Bike Map and it indicates that there are stations with docks where there are no ebikes, and ebike exclusive stations where it says, “Parking for pedal bikes can be found at the docks icons on your map”, suggesting that docks refer to pedal bikes. Therefore, I will assume for the sake of this report that from December 2020 onward, when there are three types of bikes, that docked type continues to refer to a type of pedal bike. Still, considering that ebikes can be docked at any docking station, it is not 100% clear that this is the case. Please keep this in mind since I cannot determine absolute certainty due to the limitations of the case study and would confirm with a member of the Divvy Team in a real situation before proceeding.
Here are three summaries of the monthly data to show the differences in the data outlined above.
glimpse(april_2020)
## Rows: 84,776
## Columns: 13
## $ ride_id <chr> "A847FADBBC638E45", "5405B80E996FF60D", "5DD24A79A4…
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ started_at <dttm> 2020-04-26 17:45:14, 2020-04-17 17:08:54, 2020-04-…
## $ ended_at <dttm> 2020-04-26 18:12:03, 2020-04-17 17:17:03, 2020-04-…
## $ start_station_name <chr> "Eckhart Park", "Drake Ave & Fullerton Ave", "McClu…
## $ start_station_id <dbl> 86, 503, 142, 216, 125, 173, 35, 434, 627, 377, 508…
## $ end_station_name <chr> "Lincoln Ave & Diversey Pkwy", "Kosciuszko Park", "…
## $ end_station_id <dbl> 152, 499, 255, 657, 323, 35, 635, 382, 359, 508, 37…
## $ start_lat <dbl> 41.8964, 41.9244, 41.8945, 41.9030, 41.8902, 41.896…
## $ start_lng <dbl> -87.6610, -87.7154, -87.6179, -87.6975, -87.6262, -…
## $ end_lat <dbl> 41.9322, 41.9306, 41.8679, 41.8992, 41.9695, 41.892…
## $ end_lng <dbl> -87.6586, -87.7238, -87.6230, -87.6722, -87.6547, -…
## $ member_casual <chr> "member", "member", "member", "member", "casual", "…
glimpse(july_2020)
## Rows: 551,480
## Columns: 13
## $ ride_id <chr> "762198876D69004D", "BEC9C9FBA0D4CF1B", "D2FD8EA432…
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ started_at <dttm> 2020-07-09 15:22:02, 2020-07-24 23:56:30, 2020-07-…
## $ ended_at <dttm> 2020-07-09 15:25:52, 2020-07-25 00:20:17, 2020-07-…
## $ start_station_name <chr> "Ritchie Ct & Banks St", "Halsted St & Roscoe St", …
## $ start_station_id <dbl> 180, 299, 329, 181, 268, 635, 113, 211, 176, 31, 14…
## $ end_station_name <chr> "Wells St & Evergreen Ave", "Broadway & Ridge Ave",…
## $ end_station_id <dbl> 291, 461, 156, 94, 301, 289, 140, 31, 191, 142, 31,…
## $ start_lat <dbl> 41.90687, 41.94367, 41.93259, 41.89076, 41.91172, 4…
## $ start_lng <dbl> -87.62622, -87.64895, -87.63643, -87.63170, -87.626…
## $ end_lat <dbl> 41.90672, 41.98404, 41.93650, 41.91831, 41.90799, 4…
## $ end_lng <dbl> -87.63483, -87.66027, -87.64754, -87.63628, -87.631…
## $ member_casual <chr> "member", "member", "casual", "casual", "member", "…
glimpse(dec_2020)
## Rows: 131,573
## Columns: 13
## $ ride_id <chr> "70B6A9A437D4C30D", "158A465D4E74C54A", "5262016E0F…
## $ rideable_type <chr> "classic_bike", "electric_bike", "electric_bike", "…
## $ started_at <dttm> 2020-12-27 12:44:29, 2020-12-18 17:37:15, 2020-12-…
## $ ended_at <dttm> 2020-12-27 12:55:06, 2020-12-18 17:44:19, 2020-12-…
## $ start_station_name <chr> "Aberdeen St & Jackson Blvd", NA, NA, NA, NA, NA, N…
## $ start_station_id <chr> "13157", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ end_station_name <chr> "Desplaines St & Kinzie St", NA, NA, NA, NA, NA, NA…
## $ end_station_id <chr> "TA1306000003", NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat <dbl> 41.87773, 41.93000, 41.91000, 41.92000, 41.80000, 4…
## $ start_lng <dbl> -87.65479, -87.70000, -87.69000, -87.70000, -87.590…
## $ end_lat <dbl> 41.88872, 41.91000, 41.93000, 41.91000, 41.80000, 4…
## $ end_lng <dbl> -87.64445, -87.70000, -87.70000, -87.70000, -87.590…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
unique(april_2020$rideable_type)
## [1] "docked_bike"
unique(july_2020$rideable_type)
## [1] "docked_bike" "electric_bike"
unique(dec_2020$rideable_type)
## [1] "classic_bike" "electric_bike" "docked_bike"
I converted all pre-Dec 2020 station ID data to character type in order to be able to bind the data.
I created a data frame: all_trips_monthly
I removed start and end station ID and lat./long. data
april_2020 <- mutate(april_2020, start_station_id = as.character(start_station_id)
, end_station_id = as.character(end_station_id))
may_2020 <- mutate(may_2020, start_station_id = as.character(start_station_id)
, end_station_id = as.character(end_station_id))
june_2020 <- mutate(june_2020, start_station_id = as.character(start_station_id)
, end_station_id = as.character(end_station_id))
july_2020 <- mutate(july_2020, start_station_id = as.character(start_station_id)
, end_station_id = as.character(end_station_id))
aug_2020 <- mutate(aug_2020, start_station_id = as.character(start_station_id)
, end_station_id = as.character(end_station_id))
sept_2020 <- mutate(sept_2020, start_station_id = as.character(start_station_id)
, end_station_id = as.character(end_station_id))
oct_2020 <- mutate(oct_2020, start_station_id = as.character(start_station_id)
, end_station_id = as.character(end_station_id))
nov_2020 <- mutate(nov_2020, start_station_id = as.character(start_station_id)
, end_station_id = as.character(end_station_id))
all_trips_monthly <- bind_rows(april_2020, may_2020, june_2020, july_2020,
aug_2020, sept_2020, oct_2020, nov_2020,
dec_2020, jan_2021, feb_2021, march_2021,
april_2021, may_2021, june_2021)
all_trips_monthly <- all_trips_monthly %>%
select(-c(start_station_id, end_station_id, start_lat, start_lng, end_lat, end_lng))
glimpse(all_trips_monthly)
skim_without_charts(all_trips_monthly)
Here is a summary of the final data for all_trips_monthly to confirm this step in the completion of data cleaning.
Note: There is a slight difference, 209, between the number of rows and the number of unique ride_ids. The numbers were unique and matched for each month, so I am not concerned about this small number affecting the results.
glimpse(all_trips_monthly)
## Rows: 5,088,206
## Columns: 7
## $ ride_id <chr> "A847FADBBC638E45", "5405B80E996FF60D", "5DD24A79A4…
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ started_at <dttm> 2020-04-26 17:45:14, 2020-04-17 17:08:54, 2020-04-…
## $ ended_at <dttm> 2020-04-26 18:12:03, 2020-04-17 17:17:03, 2020-04-…
## $ start_station_name <chr> "Eckhart Park", "Drake Ave & Fullerton Ave", "McClu…
## $ end_station_name <chr> "Lincoln Ave & Diversey Pkwy", "Kosciuszko Park", "…
## $ member_casual <chr> "member", "member", "member", "member", "casual", "…
nrow(all_trips_monthly)
## [1] 5088206
length(unique(all_trips_monthly$ride_id))
## [1] 5087997
I checked if the column name’s for quarterly and monthly data matched in both data frames, and then bound them together.
I created a combined data frame: all_trips
colnames(all_trips_quarter)
colnames(all_trips_monthly)
all_trips <- bind_rows(all_trips_quarter, all_trips_monthly)
skim_without_charts(all_trips)
Here is a quick confirmation of the combined data frame.
glimpse(all_trips)
## Rows: 8,968,028
## Columns: 7
## $ ride_id <chr> "22178529", "22178530", "22178531", "22178532", "22…
## $ started_at <dttm> 2019-04-01 00:02:22, 2019-04-01 00:03:02, 2019-04-…
## $ ended_at <dttm> 2019-04-01 00:09:48, 2019-04-01 00:20:30, 2019-04-…
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ start_station_name <chr> "Daley Center Plaza", "Wood St & Taylor St", "LaSal…
## $ end_station_name <chr> "Desplaines St & Kinzie St", "Wabash Ave & Roosevel…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
I added columns that list the date, month, day, and year of each ride and checked the data. As Hartman noted in his report (R-script), the original data “at the ride level..is too granular.”
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%a")
glimpse(all_trips)
Trip duration was dropped from the original data in q1_2020, and also I removed that data column for 2019 earlier.
I recalculated trip duration and created a new metric in a column called ride_length that shows the duration of each trip.
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
skim_without_charts(all_trips$ride_length)
I converted the data to numeric type in order to perform calculations on the data.
is.factor(all_trips$ride_length)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
glimpse(all_trips$ride_length)
skim_without_charts(all_trips)
The data frame includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative, as stated by Hartman.
I confirmed this by examining the data to find negative durations and a station named ‘HQ QR’, as seen below.
all_trips %>%
count(ride_length < 0)
## # A tibble: 2 x 2
## `ride_length < 0` n
## <lgl> <int>
## 1 FALSE 8957334
## 2 TRUE 10694
all_trips %>%
count(start_station_name == "HQ QR")
## # A tibble: 3 x 2
## `start_station_name == "HQ QR"` n
## <lgl> <int>
## 1 FALSE 8682193
## 2 TRUE 3767
## 3 NA 282068
I created a new version of the data frame (v2) since data is being removed, as per Hartman’s advice. A quick confirmation and summary is included.
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
glimpse(all_trips_v2)
## Rows: 8,953,684
## Columns: 13
## $ ride_id <chr> "22178529", "22178530", "22178531", "22178532", "22…
## $ started_at <dttm> 2019-04-01 00:02:22, 2019-04-01 00:03:02, 2019-04-…
## $ ended_at <dttm> 2019-04-01 00:09:48, 2019-04-01 00:20:30, 2019-04-…
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ start_station_name <chr> "Daley Center Plaza", "Wood St & Taylor St", "LaSal…
## $ end_station_name <chr> "Desplaines St & Kinzie St", "Wabash Ave & Roosevel…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
## $ date <date> 2019-04-01, 2019-04-01, 2019-04-01, 2019-04-01, 20…
## $ month <chr> "04", "04", "04", "04", "04", "04", "04", "04", "04…
## $ day <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01…
## $ year <chr> "2019", "2019", "2019", "2019", "2019", "2019", "20…
## $ day_of_week <chr> "Mon", "Mon", "Mon", "Mon", "Mon", "Mon", "Mon", "M…
## $ ride_length <dbl> 446, 1048, 252, 357, 1007, 257, 548, 383, 2137, 212…
The data is now ready for analysis.
My main focus was to look at how member and casual riders differ by examining two main measures: duration and bike type. I wanted to how long riders used the bikes in accordance with my benchmarks, and I wanted to see if the introduction of ebikes increased ridership
In the original analysis following Hartman (the original quarterly data set), I calculated some basic statistics for duration (ride_length), which was followed by aggregating duration by user type (member_casual). These stats were then proceeded by further aggregation by day of week with some accompanying charts. I did the same process when analyzing the monthly data in the second analysis, where I also made some additional visualizations. In order to get a better picture, I aggregated by bike type (rideable_type) and year, by user type and month, by user type and month and year, and by bike type and month from June 2020 to the end of the data set.
The code below shows the original R-script with Hartman’s comments plus my additional code. It is for the combined data set and is here to demonstrate my thinking process. The results and visualizations are not shown.
summary(all_trips_v2$ride_length)
# Compare members and casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
# See the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
# Notice that the days of the week are out of order. Let's fix that.
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
# Now, let's run the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
# analyze ridership data by type and weekday
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday) %>% #groups by usertype and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday) # sorts
# Let's visualize the number of rides by rider type
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
# Let's visualize average duration by rider type
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge")
#-------------------------------------------------------------
# From here is my code - thanks to Hartman for the assistance
#-------------------------------------------------------------
# A visualization showing proportion of member/casual by bike type and year
all_trips_v2 %>%
drop_na(member_casual, rideable_type) %>%
group_by(member_casual, rideable_type) %>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual)) +
facet_wrap(~year)
# A visualization showing proportion of member/casual by month
all_trips_v2 %>%
drop_na(member_casual, month) %>%
group_by(member_casual, month) %>%
ggplot(aes(x = month)) +
geom_bar(aes(fill = member_casual))
# A visualization showing proportion of member/casual by month and year
all_trips_v2 %>%
drop_na(member_casual, month) %>%
group_by(member_casual, month) %>%
ggplot(aes(x = month)) +
geom_bar(aes(fill = member_casual)) +
facet_wrap(~year)
# N.B. some of the aggregations mentioned above are not shown here and can be seen later sections
# A visualization showing average duration by month and year, member/casual
all_trips_v2 %>%
drop_na(member_casual, year, month) %>%
group_by(member_casual, year, month) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, year, month) %>%
ggplot(aes(x = month, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
facet_wrap(~year)
# Visualizations of proportion of members/casual by type of bike for each month in 2020
# For 2020, I used the monthly tables
june_2020%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
july_2020%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
aug_2020%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
sept_2020%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
oct_2020%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
nov_2020%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
dec_2020%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
# For 2021,I filtered the all_trips table
all_trips_v2%>%
filter(month == "01")%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
all_trips_v2%>%
filter(month == "02")%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
all_trips_v2%>%
filter(month == "03")%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
all_trips_v2%>%
filter(month == "04")%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
all_trips_v2%>%
filter(month == "05")%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
all_trips_v2%>%
filter(month == "06")%>%
group_by(member_casual, rideable_type)%>%
ggplot(aes(x = rideable_type)) +
geom_bar(aes(fill = member_casual))
I was planning to explore this second data set more. However, it was at this point that I decided that I wanted to combine the data sets and fill in the gaps in order to get a complete picture of the effects of Covid and ebikes. One main reason was because I noticed a proportional decrease in casual members as summer turned to winter, but a proportional increase in 2021 as winter gave way to spring. I wanted to see the changes across a time span of more than one year, so I created a data set of two years and one quarter.
In the combined analysis, I retraced my steps repeating the same cleaning and analyzing to check the data (see the code above), and then I jumped straight to a look at duration and rider count by month.
Looking at the visualization for monthly stats, I noticed some unusual high durations for January and February, as seen in the chart below. This is also very noticeable when separated into year (chart not shown) This was puzzling because in winter months, ride time would normally be lower given the cold. After noticing this, I took a side-step to explore the source of this out-of-place result. The yearly chart (see below in the next section) pointed towards the beginning of 2020 as the place to look.
all_trips_v2 %>%
drop_na(member_casual, month) %>%
group_by(member_casual, month) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, month) %>%
ggplot(aes(x = month, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge")
In the data aggregated by year, month and member_casual, you can clearly see some very high averages at the beginning of 2020 which is distorting average duration.
#Stats by year/month
all_trips_v2 %>%
drop_na(member_casual, month, year) %>%
group_by(member_casual, month, year) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, month, year)
## # A tibble: 54 x 5
## # Groups: member_casual, month [24]
## member_casual month year number_of_rides average_duration
## <chr> <chr> <chr> <int> <dbl>
## 1 casual 01 2020 7785 9699.
## 2 casual 01 2021 15916 1628.
## 3 casual 02 2020 12314 7997.
## 4 casual 02 2021 9219 3160.
## 5 casual 03 2020 24615 4250.
## 6 casual 03 2021 78867 2362.
## 7 casual 04 2019 47744 3057.
## 8 casual 04 2020 23610 4388.
## 9 casual 04 2021 126181 2370.
## 10 casual 05 2019 81624 3074.
## # … with 44 more rows
Looking at the maximum values of ride_length in more detail below, for all data from the monthly data set (April 2020 and onward), there are some max values around 3,000,000 (3M), but none higher than about 3.5M. From the quarterly data, we see all values over 2M, with a max duration as high as around 9M- three times higher than the highest in the other data set. For perspective, that is equal to about 34 days versus 104 days. It seems odd that there would be a bike trip lasting 3 months, let alone 1 month!
Even though there are some high max values in other months, they only seem to have an adverse effect on the winter months and casual users. This is not logical given the winter conditions of a city in the north next to Lake Michigan in addition to the stats that are lower on all major measures and months. I started to suspect that there was something going on.
aggregate(all_trips_v2$ride_length ~ all_trips_v2$month + all_trips_v2$year, FUN = max)
## all_trips_v2$month all_trips_v2$year all_trips_v2$ride_length
## 1 04 2019 4757638
## 2 05 2019 4519226
## 3 06 2019 2348763
## 4 07 2019 9056634
## 5 08 2019 7667470
## 6 09 2019 4563322
## 7 10 2019 8582302
## 8 11 2019 3307285
## 9 12 2019 3246842
## 10 01 2020 9387024
## 11 02 2020 8636205
## 12 03 2020 5627611
## 13 04 2020 3523202
## 14 05 2020 1733813
## 15 06 2020 2476260
## 16 07 2020 2997927
## 17 08 2020 2450782
## 18 09 2020 3257001
## 19 10 2020 2143463
## 20 11 2020 2156040
## 21 12 2020 584459
## 22 01 2021 1189555
## 23 02 2021 1807754
## 24 03 2021 1900899
## 25 04 2021 2866602
## 26 05 2021 3235296
## 27 06 2021 3356649
# aggregate(all_trips_v2$ride_length ~ all_trips_v2$month + all_trips_v2$year, FUN = mean)
From here, I honed in on the first three months where the duration averages were unusual, suspecting that some outlier data points could be the source of the distortion. I compared some key stats for the first three months of 2020 and 2021 as seen in the chart below. Where there were some very high data points, the median and mean were very far apart from each other. This suggests that there are enough of these outlier data points to distort the average. I investigated further by looking at the density of the three months across the two years.
Comparing the scatter plot for Q1 of 2020 with random noise with one for Q1 of 2021, they clearly show a lot of distortion coming from casual members in 2020. Note that in 2021, there is no distortion in the member side, and all data points are under 2M, with most outliers under 500K on the casual side. This is not the case in 2020, where there are quite a lot of data points over 500K comparatively speaking. This reinforced my suspicions that outlier data points were not actual trips and were affecting the data in negative way.
all_trips_v2 %>%
drop_na(member_casual, month) %>%
group_by(member_casual, month, year) %>%
summarise(count = n(), median = median(ride_length),
mean = mean(ride_length), max = max(ride_length)) %>%
filter(month == "01" | month == "02" | month == "03" ) %>%
arrange(member_casual, month, year)
## # A tibble: 12 x 7
## # Groups: member_casual, month [6]
## member_casual month year count median mean max
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 casual 01 2020 7785 1028 9699. 9387024
## 2 casual 01 2021 15916 747 1628. 1189555
## 3 casual 02 2020 12314 1297 7997. 8636205
## 4 casual 02 2021 9219 998 3160. 1807754
## 5 casual 03 2020 24615 1549 4250. 3435010
## 6 casual 03 2021 78867 1154 2362. 1900899
## 7 member 01 2020 136099 493 669. 625452
## 8 member 01 2021 72291 526 778. 89997
## 9 member 02 2020 126714 494 768. 4753051
## 10 member 02 2021 36357 614 1105. 89997
## 11 member 03 2020 115593 570 860. 5627611
## 12 member 03 2021 134779 602 839. 93596
q1_2020$ride_length <- difftime(q1_2020$ended_at,q1_2020$started_at)
q1_2020$ride_length <- as.numeric(as.character(q1_2020$ride_length))
q1_2020$date <- as.Date(q1_2020$started_at) #The default format is yyyy-mm-dd
q1_2020$month <- format(as.Date(q1_2020$date), "%m")
q1_2020 %>%
group_by(month, ride_length) %>%
ggplot(aes(x = month, y = ride_length, colour = member_casual)) +
geom_jitter() +
facet_wrap(~member_casual) +
ggtitle("Jan to Mar 2020")
all_trips_v2 %>%
filter(month == "01" & year == "2021" | month == "02" & year == "2021" | month == "03" & year == "2021" ) %>%
ggplot(aes(x = month, y = ride_length, colour = member_casual)) +
geom_jitter() +
facet_wrap(~member_casual) +
ggtitle("Jan to Mar 2021")
With all majority of the high data points coming from casual users and having dropped the maintenance stations from the data, I concluded that there must be some anomaly in the data, like thefts, lost bikes, bikes not docked properly, a glitch in the software tracking the bikes, or something else causing the high duration times that are not intentional by casual riders.
In this situation in a real scenario, I would inquire further with the data team, the data engineers, or someone in charge of the collection of data about these high data point and their possible cause. For the purpose of this analysis, I will continue on with my assumption.
From my research on the bike share system, day trip passes are valid for 24 hours and allow 3 hour rides before docking is necessary to avoid extra payment. Under the assumption that the high values distorting the data are not actual rides, I decided to see how removing data over these two bench marks affected the key measures, mainly average duration.
I made three new data frame that contains the monthly data aggregated by year and user type, called monthly_user_stats, monthly_user_stats_3hr, monthly_user_stats_24hr. I also calculated how much data is lost when removing rides that exceeded those duration benchmarks. As you can see below, the number of rides over 3 hours is only a little over half a percent, while the number of rides over 24 hours is negligible.
# Monthly table for all trips with duration greater than 0
monthly_user_stats <- all_trips_v2 %>%
drop_na(member_casual, month) %>%
group_by(member_casual, month, year) %>%
summarise(total = n(), median_ride = median(ride_length), max_ride = max(ride_length)
, average_duration = mean(ride_length)) %>%
arrange(member_casual, month, year) %>%
rename(user = member_casual)
# Create a table for trips under 10800 seconds/3 hrs
monthly_user_stats_3hrs <- all_trips_v2 %>%
filter(ride_length <= 10800) %>%
drop_na(member_casual, month) %>%
group_by(member_casual, month, year) %>%
summarise(total = n(), median_ride = median(ride_length), max_ride = max(ride_length)
, average_duration = mean(ride_length)) %>%
arrange(member_casual, month, year) %>%
rename(user = member_casual)
m <- nrow(filter(all_trips_v2, !(ride_length <= 10800)))
n <- nrow(all_trips_v2)
lossage <- m/n*100
capture.output(cat("The amount of data lost is if only include rides up to 3 hours:",lossage, "%"))
## [1] "The amount of data lost is if only include rides up to 3 hours: 0.5613332 %"
# Create a table for trips under 86400 seconds/24 hrs
monthly_user_stats_24hrs <- all_trips_v2 %>%
filter(ride_length <= 86400) %>%
drop_na(member_casual, month) %>%
group_by(member_casual, month, year) %>%
summarise(total = n(), median_ride = median(ride_length), max_ride = max(ride_length)
,average_duration = mean(ride_length)) %>%
arrange(member_casual, month, year) %>%
rename(user = member_casual)
m <- nrow(filter(all_trips_v2, !(ride_length <= 86400)))
lossage <- m/n*100
capture.output(cat("The amount of data lost is if only include rides up to 24 hours:",lossage,"%"))
## [1] "The amount of data lost is if only include rides up to 24 hours: 0.06914472 %"
I then wanted to see how strongly these large ride times were impacting the ride length data. I created two monthly comparison charts that displayed the type of user, the month and year, the number of rides and the average duration of the whole data set compared to the reduced data set at 3 hrs and 24 hrs.
# 3 hour monthly data frame comparison
monthly_stat_comparison_3hr <- bind_cols(monthly_user_stats, monthly_user_stats_3hrs)
monthly_stat_comparison_3hr <- monthly_stat_comparison_3hr %>%
select(c(1,2,3,4,7,11,14)) %>%
rename("user"=1, "mo."=2, "yr."=3, "rides"=4, "avg_dur"=5, "rides_3hr"=6, "avg_dur_3hr"=7)
monthly_stat_comparison_3hr$diff_rides <- (monthly_stat_comparison_3hr$rides - monthly_stat_comparison_3hr$rides_3hr)
monthly_stat_comparison_3hr$diff_dur <- (monthly_stat_comparison_3hr$avg_dur - monthly_stat_comparison_3hr$avg_dur_3hr)
# 24 hour monthly data frame comparison
monthly_stat_comparison_24hr <- bind_cols(monthly_user_stats, monthly_user_stats_24hrs)
monthly_stat_comparison_24hr <- monthly_stat_comparison_24hr %>%
select(c(1,2,3,4,7,11,14)) %>%
rename("user"=1, "mo."=2, "yr."=3, "rides"=4, "avg_dur"=5, "rides_24hr"=6, "avg_dur_24hr"=7)
monthly_stat_comparison_24hr$diff_rides <- (monthly_stat_comparison_24hr$rides - monthly_stat_comparison_24hr$rides_24hr)
monthly_stat_comparison_24hr$diff_dur <- (monthly_stat_comparison_24hr$avg_dur - monthly_stat_comparison_24hr$avg_dur_24hr)
print(monthly_stat_comparison_3hr, n = 54)
## # A tibble: 54 x 9
## user mo. yr. rides avg_dur rides_3hr avg_dur_3hr diff_rides diff_dur
## <chr> <chr> <chr> <int> <dbl> <int> <dbl> <int> <dbl>
## 1 casual 01 2020 7785 9699. 7504 1452. 281 8247.
## 2 casual 01 2021 15916 1628. 15798 1157. 118 471.
## 3 casual 02 2020 12314 7997. 11991 1759. 323 6238.
## 4 casual 02 2021 9219 3160. 9038 1509. 181 1651.
## 5 casual 03 2020 24615 4250. 24136 2048. 479 2202.
## 6 casual 03 2021 78867 2362. 77980 1702. 887 660.
## 7 casual 04 2019 47744 3057. 47278 2257. 466 799.
## 8 casual 04 2020 23610 4388. 23198 2031. 412 2356.
## 9 casual 04 2021 126181 2370. 124708 1691. 1473 679.
## 10 casual 05 2019 81624 3074. 80617 2267. 1007 807.
## 11 casual 05 2020 86844 3073. 84781 2222. 2063 851.
## 12 casual 05 2021 231178 2416. 228255 1734. 2923 682.
## 13 casual 06 2019 130218 2755. 128625 2195. 1593 560.
## 14 casual 06 2020 154551 3100. 151063 2115. 3488 985.
## 15 casual 06 2021 327831 2361. 323835 1605. 3996 756.
## 16 casual 07 2019 175632 3587. 173360 2219. 2272 1368.
## 17 casual 07 2020 268606 3598. 261867 2215. 6739 1383.
## 18 casual 08 2019 186889 4020. 184345 2168. 2544 1852.
## 19 casual 08 2020 285379 2708. 280653 1959. 4726 749.
## 20 casual 09 2019 129173 3100. 127588 2028. 1585 1072.
## 21 casual 09 2020 221367 2330. 218867 1743. 2500 586.
## 22 casual 10 2019 71035 3540. 70002 1827. 1033 1712.
## 23 casual 10 2020 131259 1886. 130180 1503. 1079 383.
## 24 casual 11 2019 18723 4022. 18322 1513. 401 2509.
## 25 casual 11 2020 78556 2004. 77914 1577. 642 427.
## 26 casual 12 2019 16430 3800. 16098 1698. 332 2102.
## 27 casual 12 2020 26449 1699. 26257 1274. 192 425.
## 28 member 01 2020 136099 669. 136005 624. 94 44.7
## 29 member 01 2021 72291 778. 72142 717. 149 61.9
## 30 member 02 2020 126714 768. 126631 637. 83 132.
## 31 member 02 2021 36357 1105. 36131 829. 226 276.
## 32 member 03 2020 115593 860. 115483 751. 110 110.
## 33 member 03 2021 134779 839. 134639 807. 140 32.2
## 34 member 04 2019 217566 811. 217386 727. 180 83.3
## 35 member 04 2020 61115 1289. 60991 1029. 124 260.
## 36 member 04 2021 184989 882. 184733 838. 256 43.8
## 37 member 05 2019 285834 831. 285556 769. 278 61.7
## 38 member 05 2020 113258 1186. 113028 1117. 230 69.4
## 39 member 05 2021 246710 885. 246378 843. 332 42.0
## 40 member 06 2019 345177 873. 344793 814. 384 58.9
## 41 member 06 2020 187985 1124. 187648 1055. 337 68.8
## 42 member 06 2021 321667 882. 321209 834. 458 48.0
## 43 member 07 2019 381683 986. 381325 838. 358 148.
## 44 member 07 2020 280980 1066. 280512 1005. 468 61.3
## 45 member 08 2019 403295 971. 402999 814. 296 157.
## 46 member 08 2020 326618 1010. 326216 949. 402 61.3
## 47 member 09 2019 364046 848. 363826 775. 220 72.6
## 48 member 09 2020 289768 933. 289400 877. 368 56.8
## 49 member 10 2019 300751 782. 300563 703. 188 79.0
## 50 member 10 2020 224285 851. 223941 797. 344 54.4
## 51 member 11 2019 158440 746. 158300 644. 140 102.
## 52 member 11 2020 155976 824. 155832 796. 144 28.3
## 53 member 12 2019 138662 685. 138555 640. 107 44.7
## 54 member 12 2020 93033 770. 92924 731. 109 38.6
print(monthly_stat_comparison_24hr, n = 54)
## # A tibble: 54 x 9
## user mo. yr. rides avg_dur rides_24hr avg_dur_24hr diff_rides diff_dur
## <chr> <chr> <chr> <int> <dbl> <int> <dbl> <int> <dbl>
## 1 casual 01 2020 7785 9699. 7721 2261. 64 7438.
## 2 casual 01 2021 15916 1628. 15889 1334. 27 293.
## 3 casual 02 2020 12314 7997. 12251 2355. 63 5642.
## 4 casual 02 2021 9219 3160. 9161 1905. 58 1255.
## 5 casual 03 2020 24615 4250. 24515 2464. 100 1786.
## 6 casual 03 2021 78867 2362. 78715 1932. 152 430.
## 7 casual 04 2019 47744 3057. 47669 2452. 75 605.
## 8 casual 04 2020 23610 4388. 23507 2366. 103 2022.
## 9 casual 04 2021 126181 2370. 125950 1921. 231 449.
## 10 casual 05 2019 81624 3074. 81507 2495. 117 579.
## 11 casual 05 2020 86844 3073. 86666 2669. 178 404.
## 12 casual 05 2021 231178 2416. 230770 1976. 408 440.
## 13 casual 06 2019 130218 2755. 130066 2425. 152 330.
## 14 casual 06 2020 154551 3100. 154216 2543. 335 557.
## 15 casual 06 2021 327831 2361. 327230 1834. 601 526.
## 16 casual 07 2019 175632 3587. 175433 2437. 199 1150.
## 17 casual 07 2020 268606 3598. 267939 2678. 667 919.
## 18 casual 08 2019 186889 4020. 186625 2409. 264 1611.
## 19 casual 08 2020 285379 2708. 284923 2260. 456 449.
## 20 casual 09 2019 129173 3100. 128987 2268. 186 832.
## 21 casual 09 2020 221367 2330. 221095 1947. 272 383.
## 22 casual 10 2019 71035 3540. 70889 2129. 146 1410.
## 23 casual 10 2020 131259 1886. 131098 1670. 161 216.
## 24 casual 11 2019 18723 4022. 18647 2058. 76 1964.
## 25 casual 11 2020 78556 2004. 78465 1735. 91 269.
## 26 casual 12 2019 16430 3800. 16365 2231. 65 1569.
## 27 casual 12 2020 26449 1699. 26408 1422. 41 278.
## 28 member 01 2020 136099 669. 136082 646. 17 22.5
## 29 member 01 2021 72291 778. 72281 766. 10 12.3
## 30 member 02 2020 126714 768. 126695 659. 19 110.
## 31 member 02 2021 36357 1105. 36298 960. 59 144.
## 32 member 03 2020 115593 860. 115566 776. 27 84.5
## 33 member 03 2021 134779 839. 134769 833. 10 6.65
## 34 member 04 2019 217566 811. 217531 748. 35 62.6
## 35 member 04 2020 61115 1289. 61095 1082. 20 206.
## 36 member 04 2021 184989 882. 184965 871. 24 11.5
## 37 member 05 2019 285834 831. 285793 794. 41 36.9
## 38 member 05 2020 113258 1186. 113236 1160. 22 25.9
## 39 member 05 2021 246710 885. 246676 873. 34 12.2
## 40 member 06 2019 345177 873. 345135 843. 42 30.2
## 41 member 06 2020 187985 1124. 187963 1098. 22 26.0
## 42 member 06 2021 321667 882. 321603 864. 64 17.7
## 43 member 07 2019 381683 986. 381615 861. 68 125.
## 44 member 07 2020 280980 1066. 280936 1042. 44 24.3
## 45 member 08 2019 403295 971. 403241 832. 54 139.
## 46 member 08 2020 326618 1010. 326579 976. 39 34.3
## 47 member 09 2019 364046 848. 364004 792. 42 56.3
## 48 member 09 2020 289768 933. 289729 901. 39 32.8
## 49 member 10 2019 300751 782. 300717 720. 34 61.8
## 50 member 10 2020 224285 851. 224241 830. 44 21.6
## 51 member 11 2019 158440 746. 158394 666. 46 79.9
## 52 member 11 2020 155976 824. 155956 813. 20 11.4
## 53 member 12 2019 138662 685. 138647 663. 15 22.3
## 54 member 12 2020 93033 770. 93021 758. 12 11.5
compare_3hrs_to_24_hrs_duration <- c(monthly_stat_comparison_24hr$diff_dur - monthly_stat_comparison_3hr$diff_dur)
compare_3hrs_to_24_hrs_duration
## [1] -808.95984 -177.41104 -595.82557 -395.74891 -416.47501 -229.99131
## [7] -194.48985 -334.35080 -230.05472 -228.01917 -447.14234 -242.40478
## [13] -230.67174 -427.97375 -229.75620 -218.39610 -463.71704 -240.71106
## [19] -300.62578 -240.70977 -203.62176 -302.31227 -166.68854 -545.26135
## [25] -158.05707 -532.52753 -147.29640 -22.13153 -49.60210 -22.28224
## [31] -131.29384 -25.03519 -25.52783 -20.74420 -53.46343 -32.29837
## [37] -24.83281 -43.48702 -29.79099 -28.66920 -42.85319 -30.24563
## [43] -23.05305 -36.96603 -18.19795 -26.96707 -16.30551 -24.03485
## [49] -17.19730 -32.75682 -22.01300 -16.86407 -22.38140 -27.10596
On the whole, the average duration changes very little as seen in the charts above. Looking at the 3 hour trip ride data for members, all changes in average are under 5 minutes (see the column titled diff_dur). For casual users, the change ranges from around a 7 to 40-minute difference, with the exception being two huge differences for Jan and Feb 2020, as the previous Jan to Mar 2020 scatter plot suggested. Almost all larger changes occur in the quarterly data set before April 2020. In addition, the difference in ride numbers range from 83 to 6739, with all member ride number differences in the hundreds. Again, this shows that only a very small percentage of the total riders who are taking trips over 3 hrs, and the vast majority are from casual users.
Taking into consideration the 24 hr data, we see the same general trends in the differences. Narrowing the data has almost no effect on the average member duration. A difference of the differences display a range from a max of less than 15 minutes to a min of 16 seconds. Again, the biggest differences are seen in the Q1 2020. This means that aggregating the data from 24 hour trips to 3 hour trips does not show a major change in the average duration length. And, the difference in the number of rides for trips under 24 hours compared to the whole vary from 10 to 667. There are so few rides over 24 hours that the ones that are are distorting the data and not accurately representing how users use the bikes. This suggests to me that something changed in how the bikes were managed or dealt with, or how the data was managed or collected. Again, I would inquire more into this situation if it were not a role-play. Eliminating the noise in the long tails would give us a better picture of how users, both member and casual, use the bike share service.
This noise can be confirmed with some density diagrams below. For 24 hr trips, the data still has a huge right-skew. There is a 99% chance that a casual member (bandwidth of about 60 seconds) uses a bike for about 3 hours and 20 mins or less; the 99-percentile for members (bandwidth of about 20 seconds) is just over 45 mins. Comparing the 24hr chart to the 3hr chart, there still is a positive-skew, and for members the 99 percentile is virtually the same. For casual users, 95% of users ride a bike for around 100 minutes or less in both cases.
# Data Frames that filter out trip rides over 24 hours
member_24 <- all_trips_v2 %>%
filter(member_casual == "member" & ride_length < 86400) %>%
drop_na(ride_length)
casual_24 <- all_trips_v2 %>%
filter(member_casual == "casual" & ride_length < 86400) %>%
drop_na(ride_length)
# Density Plots and Key Percentiles for 24 hrs
member_24 %>%
ggplot( aes(x=ride_length)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
scale_y_continuous(labels = scales::comma) +
ggtitle("Density of Members: Rides up to 24 Hrs")
#density_member_24 <- density(member_24$ride_length)
#plot(density_member_24, main = "Density of Members: Rides up to 24 Hrs")
"Quantiles for Members with Ride Length under 24-Hours"
## [1] "Quantiles for Members with Ride Length under 24-Hours"
quantile(member_24$ride_length, c(.5, .75, .9, .95, .99))
## 50% 75% 90% 95% 99%
## 628 1065 1662 2084 3092
casual_24 %>%
ggplot( aes(x=ride_length)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
scale_y_continuous(labels = scales::comma) +
ggtitle("Density of Casual Riders: Rides up to 24 Hrs")
#density_casual_24 <- density(casual_24$ride_length)
#plot(density_casual_24, main = "Density of Casual Riders: Rides up to 24 Hrs")
"Quantiles for Casual Riders with Ride Length under 24-Hours"
## [1] "Quantiles for Casual Riders with Ride Length under 24-Hours"
quantile(casual_24$ride_length, c(.5, .75, .9, .95, .99))
## 50% 75% 90% 95% 99%
## 1331 2434 4610 6542 11939
# Data Frames that filter out trip rides over 3 hours
member_3 <- all_trips_v2 %>%
filter(member_casual == "member" & ride_length < 10800) %>%
drop_na(ride_length)
casual_3 <- all_trips_v2 %>%
filter(member_casual == "casual" & ride_length < 10800) %>%
drop_na(ride_length)
# Density Plots and Key Percentiles for 3 hrs
member_3 %>%
ggplot( aes(x=ride_length)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
scale_y_continuous(labels = scales::comma) +
ggtitle("Density of Members: Rides up to 3 Hrs")
#density_member_3 <- density(member_3$ride_length)
#plot(density_member_3, main = "Density of Members: Rides up to 3 Hrs")
"Quantiles for Members with Ride Length under 3-Hours"
## [1] "Quantiles for Members with Ride Length under 3-Hours"
quantile(member_3$ride_length, c(.5, .75, .9, .95, .99))
## 50% 75% 90% 95% 99%
## 627 1063 1656 2074 2998
casual_3 %>%
ggplot( aes(x=ride_length)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
scale_y_continuous(labels = scales::comma) +
ggtitle("Density of Casual Riders: Rides up to 3 Hrs")
#density_casual_3 <- density(casual_3$ride_length)
#plot(density_casual_3, main = "Density of Casual Riders: Rides up to 3 Hrs")
"Quantiles for Casual Riders with Ride Length under 3-Hours"
## [1] "Quantiles for Casual Riders with Ride Length under 3-Hours"
quantile(casual_3$ride_length, c(.5, .75, .9, .95, .99))
## 50% 75% 90% 95% 99%
## 1313 2356 4324 5905 8959
With such a big skew of the distribution when considering all the recorded trips, I concluded that reducing the data set would give me a more accurate picture of how member and casual users differ. Only about 1 in 180 rides were over 3 hours while 1 in 1450 rides were over 24 hours, thus I decided to limit my data to rides under 3 hours.
The most important insights are the increased ridership on weekends by casual users and a more level use across all days of the week by members. I think this is a result of “isolation fever” and a shifting mindset; people want to get out of the house and enjoy their time, while also changing the way of working. They also want to avoid exposing themselves to COVID in an indoor setting. With the risk serious infection in an outdoor environment seeming to be lower than thought at the beginning of the pandemic, people were anxious to get out and enjoy the nice weather in Chicago. If this is the case, we have to focus on this demographic and encourage them to purchase an annual membership.
To be able to convert casual riders to members, we should create three new campaigns. The first is give a one month free bonus to users who participate in our social media campaign. The second is to have a limited time purchase of $99 for a yearly membership. The third is to offer a shortened, 6-month trial membership. These efforts should result in an increase of the sales of annual memberships by casual users.
In order to achieve this end, we will promote the The Ride, Tag, and Play Campaign. This social media campaign is aimed at casual riders of two sorts: those who are using the bike share program for single ride use such as occasional commuting or other business purposes, and for those who use the bike system for weekend recreation. The goal is to emphasis the fact you can use your annual membership for both any time of the year stress-free. Single users will not have to worry about a 30 minute time limit nor the excess need of a 3-hour time cushion. The accompanying slogan is “Go anywhere in Chicago in under 45 minutes!”. This will aim at riders who even if they usually purchase a day-pass, will know that in most cases, 45 minutes is enough time. And the best part of the pass is that users can use it any time, all year round. It will enable them to enjoy fun-filled seasonal events without the hassle of purchasing non-member rides.
To be eligible to get the one month free discount, you must @ mention “Cyclistic Bike Share” or tag “Cyclistic”, “Bike share”, or “ride and play” in an Instagram post about where you want to go or rode to already. For example, “Going downtown to catch the Cubs game” or “Beat the traffic jam and got to work 15 minutes early”. Any non-member will be eligible to receive the $11 discount on the annual pass anytime if they participate in the campaign. We want our bike share program trending on social media. If possible, another idea would be to work with local business and offer promotions, freebies, discounts, or other benefits. If casual riders tag and share their experience, and then purchase a pass through the campaign, they are also eligible for prizes or coupons at their favourite local stores or brand businesses. This promotion would also apply to members who tag the system, and help get the hash tags trending in order to widen the public awareness of the bike sharing system.
The early-bird promotion will be available from January to March under the slogan - “Buy your annual membership early and have it ready for the Spring”. The early-bird special will be available to all new members and existing members where their renewal period would change and any remaining months on their annual membership would also be deducted from the price. Encouraging users to purchase a membership in the winter months will get them focused on using it in the Spring and Summer and will nudge the casual user to convert. The idea of unlimited use should motivate those users to upgrade.
A final campaign idea would be a 6-month trial for about $12 per month, or $72. In this way, casual riders who are on the fence can purchase a reasonably priced pass and try out the system. If they see the benefits of this pass, an annual membership of $144, they will see the benefits of a regular or discounted pass and soon buy one after the initial trial pass has finished. This campaign only will be available from the end of August to the end of the year. So, if a casual rider purchases one now, they will see the benefits of using it in the Fall and the trial will end just at the time of the early-bird time period. If the casual rider purchases at the end of the year, they will see the benefits of using the pass in the Spring and early Summer, and will quickly want to keep the perks of the pass for the rest of the summer.
I believe these campaigns will successfully boost membership sales and convert many casual riders to members.
# all_trips_v3 %>%
# filter(ride_length < 4140) %>%
# ggplot( aes(x=ride_length, fill = member_casual)) +
# geom_histogram(binwidth = 120) +
# scale_y_continuous(labels = scales::comma)
# all_trips_v3 %>%
# filter(ride_length >= 1800) %>%
# ggplot( aes(x=ride_length, fill = member_casual)) +
# geom_histogram(binwidth = 1800) +
# scale_y_continuous(labels = scales::comma) +
# labs( title = "Count of Total Duration 30 Minutes to 3 hour 30 minutes, by 30-minute Rides", caption = "Note: the first bin includes ride length 30 to 45 minutes")
# all_trips_v3$year_month <- format(as.Date(all_trips_v3$date), "%Y/%m")
# __________________________________________________
# all_trips_v3 %>%
# drop_na(member_casual, rideable_type) %>%
# group_by(member_casual, rideable_type) %>%
# summarise(number_of_rides = n()) %>%
# arrange(member_casual, rideable_type)
# # print(n = Inf)
# #
# all_trips_v3 %>%
# drop_na(year_month, member_casual, rideable_type) %>%
# group_by(year_month, member_casual, rideable_type) %>%
# summarise(number_of_rides = n()) %>%
# ggplot(aes(x = year_month, y = number_of_rides, group = rideable_type, colour = rideable_type)) +
# geom_line() +
# facet_wrap(~member_casual) +
# theme(axis.ticks = element_blank(), axis.text.x = element_blank())
# #
# all_trips_v3 %>%
# drop_na(year, month, rideable_type) %>%
# group_by(year, month, rideable_type) %>%
# summarise(number_of_rides = n()) %>%
# ggplot() +
# geom_bar(aes(x = month, y = number_of_rides), stat = "identity", fill="#CCFFFF", width=0.5) +
# #geom_point(aes(x = month, y = number_of_rides, group=rideable_type, colour = rideable_type)) +
# geom_line(aes(x = month, y = number_of_rides, group=rideable_type, colour = rideable_type)) +
# scale_colour_manual(values=c("#0072B2","#009E73", "#D55E00")) +
# #scale_fill_hue(c=45,l=80) +
# facet_wrap(~year)